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Abstract.  Information  diffusion  is  a  natural  phenomenon  that  informa¬ 
tion  propagates  from  nodes  to  nodes  over  a  social  network.  The  behavior 
that  a  node  adopts  an  information  piece  in  a  social  network  can  be 
affected  by  different  factors.  Previously,  many  diffusion  models  are  pro¬ 
posed  to  consider  one  or  several  fixed  factors.  The  factors  affecting  the 
adoption  decision  of  a  node  are  different  from  one  to  another  and  may 
not  be  seen  before.  For  a  different  scenario  of  diffusion  with  new  factors, 
previous  diffusion  models  may  not  model  the  diffusion  well,  or  are  not 
applicable  at  all.  In  this  work,  our  aim  is  to  design  a  diffusion  model 
in  which  factors  considered  are  flexible  to  extend  and  change.  We  fur¬ 
ther  propose  a  framework  of  learning  parameters  of  the  model,  which 
is  independent  of  factors  considered.  Therefore,  with  different  factors, 
our  diffusion  model  can  be  adapted  to  more  scenarios  of  diffusion  with¬ 
out  requiring  the  modification  of  the  diffusion  model  and  the  learning 
framework.  In  the  experiment,  we  show  that  our  diffusion  model  is  very 
effective  on  the  task  of  activation  prediction  on  a  Twitter  dataset. 
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1  Introduction 

Information  diffusion  in  social  networks  has  been  an  active  research  field  in  about 
a  decade.  It  is  a  natural  phenomenon  that  information  propagates  from  nodes 
to  nodes  over  a  social  network,  which  acts  like  an  epidemic.  There  are  many 
applications  on  information  diffusion,  such  as  promoting  an  idea  more  effectively 
[5,10]  ,  blocking  adverse  opinions  [3,11],  or  identifying  information  flows  [13] 
in  a  network.  A  well-known  problem  therein  is  called  influence  maximization , 
formulated  by  Kempe  et  al.  [10].  The  problem  of  influence  maximization  is  to 
find  a  group  of  target  nodes  to  be  convinced  of  an  idea  initially  to  maximize 
the  spread  size,  i.e.  the  number  of  nodes  adopting  the  idea,  on  a  given  diffusion 
model. 

To  model  how  information  diffuses  in  a  network,  researchers  have  proposed 
various  diffusion  models  from  different  aspects.  In  these  diffusion  models,  the 
Independent  Cascading  (IC)  model  and  the  Linear  Threshold  (LT)  Model  [10] 
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have  been  widely  employed  in  many  applications  and  have  several  variants  [1]. 
Both  models  consider  the  influence  strength  of  a  neighbor.  The  key  difference 
is  that,  for  a  node  turning  to  adopt  an  idea,  IC  considers  only  the  influence 
from  exactly  one  activated  neighbor  with  uncertainty,  while  LT  considers  the 
collaborative  influence  contribution  from  all  activated  neighbors.  In  other  words, 
IC  places  importance  on  which  neighbor  tries  to  affect  the  node  to  adopt  the  idea 
whereas  LT  thinks  highly  on  the  overall  influence  contribution  from  neighbors. 
Nevertheless,  the  real  world  is  so  complicated  that  a  simple  concern  is  hard  to 
capture  such  complexity.  Many  factors  probably  affect  the  decision  of  adoption. 
For  example,  an  idea  that  has  been  adopted  by  most  people  will  have  more 
chance  to  influence  somebody  [9]  and  an  idea  is  harder  to  be  adopted  by  someone 
as  time  passes.  Moreover,  a  person  may  have  different  strength  of  interests  in 
different  topics  [1],  However,  the  factors  considered  by  previous  diffusion  models 
in  social  networks  are  all  fixed.  For  a  different  scenario  of  diffusion  with  new 
factors,  previous  diffusion  models  may  not  model  the  diffusion  well,  or  are  not 
applicable  at  all.  Therefore,  one  usually  has  to  propose  a  new  diffusion  model 
for  modeling  diffusion  of  a  specific  scenario  better  by  considering  new  factors. 

To  design  a  diffusion  model  for  different  factors  one  by  one  and  to  propose  the 
corresponding  algorithms,  e.g.  parameters  learning  and  influence  maximization, 
both  become  tedious.  In  this  work,  we  aim  to  design  a  diffusion  model  which  can 
consider  multiple  factors  flexibly  and  further  propose  a  framework  of  learning 
parameters  of  the  model,  related  to  information  transmission  likelihood  between 
nodes  and  adoption  prediction  of  a  node.  To  the  best  of  our  knowledge,  no  exist¬ 
ing  work  has  the  same  sight.  Specifically,  we  propose  a  Multiple-Factors  Aware 
Diffusion  (MFAD)  model  which  is  able  to  consider  multiple  factors  flexibly  that 
may  affect  adoption  behaviors.  MFAD  is  a  two-stage  propagation  model.  In  the 
first  stage,  called  influence  transmission ,  an  activated  node  u  tries  to  influence 
its  inactivated  neighbor  v  with  a  probability.  If  the  influence  of  u  is  successfully 
transmitted  to  v,  in  the  second  stage,  called  adoption  decision ,  v  decides  whether 
it  becomes  activated  based  on  its  considerations,  predicted  via  its  related  classi¬ 
fication  model  trained  on  historical  adoption  information.  Unfortunately,  due  to 
the  limitation  of  observation  in  the  real  world,  only  positive  instances  are  avail¬ 
able  to  train  classifiers  ,  which  is  hard  to  achieve  good  performance.  We  further 
design  a  mechanism  to  get  unlabeled  instances  to  help  train  nodes’  classifiers 
and  propose  the  learning  framework  to  learn  the  classifiers  and  transmission 
probabilities  between  nodes.  Our  contributions  are  summarized  as  follows. 

1.  Our  proposed  MFAD  model  is  flexible  to  extend  and  change  factors  since 
we  employ  a  classification  approach  to  predicting  the  adoption  behavior  of 
a  node. 

2.  Our  proposed  learning  framework  is  independent  of  factors  considered  and 
we  show  the  learning  framework  is  effective  in  the  experiment. 

3.  Due  to  the  limitation  of  observation  on  diffusion  in  the  real  world,  to  predict 
adoption  behaviors  is  hard  to  reach  good  results.  We  explicitly  tackle  this 
issue  by  learning  nodes’  classifiers  for  adoption  decision  with  only  positive 
and  unlabeled  instances. 
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The  remaining  of  the  paper  is  organized  as  follows.  We  next  review  the  related 
work  in  Section  2.  We  introduce  our  diffusion  model  in  Section  3  and  how  to 
learn  the  model  in  Section  4.  In  Section  5,  we  conduct  experiments  on  a  Twitter 
dataset.  Finally,  we  conclude  in  Section  6. 

2  Related  Work 

In  this  section,  we  briefly  review  the  related  works  on  diffusion  models  in  social 
networks  and  learning  parameters  of  diffusion  models. 

Diffusion  models  interpret  how  information  spreads  within  a  network.  As 
mentioned  above,  IC  and  LT  are  two  classical  models  and  have  been  widely 
employed  since  they  were  connected  to  the  influence  maximization  problem  [10]. 
Recently,  more  factors  of  diffusion  are  explored  in  the  literature  like  [1],  [9]  and 
[14],  to  name  a  few.  N.  Barbieri  et  al.  [1]  extend  the  IC  model  to  consider  topic 
distribution  of  items.  T.-A.  Hoang  and  E.-P.  Lim  [9]  propose  a  model  considering 
three  factors,  user  virality,  user  susceptibility  and  item  virality.  Moreover,  S.A. 
Myers  et  al.  [14]  explore  not  only  internal  influence  from  activated  nodes  in  a 
network,  but  also  external  influence  outside  the  network.  However,  as  discussed 
above,  the  factors  considered  by  these  diffusion  models  are  all  fixed.  For  a  dif¬ 
ferent  scenario  of  diffusion  with  new  factors,  these  models  may  not  model  the 
diffusion  well,  or  are  not  applicable  at  all. 

Although  diffusion  models  in  social  networks  have  been  proposed  for  a  long 
time,  algorithms  to  learn  parameters  of  a  diffusion  model  are  proposed  recently. 
For  example,  K.  Saito  et  al.  [16]  first  propose  a  learning  method  for  the  IC  model. 
The  following  up  works  mainly  propose  learning  methods  for  their  own  diffusion 
models  [1].  Moreover,  A.  Goyal  et  al.  [7]  propose  the  Credit  Distribution  model 
that  directly  estimates  spread  size  from  diffusion  data  without  learning  influence 
probabilities  between  nodes.  Since  we  aim  to  design  a  learning  framework  that 
is  independent  of  the  diffusion  factors  considered,  the  above  results  do  not  apply 
to  our  scenario. 

3  Proposed  Model 

In  the  work,  our  aim  is  to  design  a  diffusion  model  which  considers  multiple 
factors  flexibly  for  information  propagation.  We  propose  the  Multiple  Factors- 
Aware  Diffusion  (MFAD)  model  in  the  section. 

Given  a  social  graph  G  =  (V,  E),  where  V  is  the  node  set  and  E  is  the 
edge  set  composed  of  directed  edges  without  multiple  edges  and  self- loops,  let 
pU:V  denote  the  probability  of  node  u  to  successfully  transmit  influence  to  node 
v  £  it’s  out-neighbors  Nout(u)  after  u  is  activated  by  an  item  i  and  let  fv(x) 
denote  a  probabilistic  classifier  for  node  v  where  fv  ( x )  considers  multiple  prede¬ 
fined  features,  i.e.  factors,  that  affect  the  tendency  of  u  to  adopt  item  i  after  v 
is  exposed  to  i  and  x  is  the  feature  vector  of  the  exposure.  Note  that  a  nonprob- 
abilistic  classification  model,  the  outputs  of  which  can  be  transformed  to  the 
probabilistic  outputs  [15],  is  applicable  to  /„(x),  e.g.  SVM  with  Platt  scaling. 
The  Multiple  Factors-Aware  Diffusion  (MFAD)  model  is  defined  as  follows. 
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Fig.  1.  The  successful  activation  process  of  MFAD 


Definition  1.  Multiple  Factors-Aware  Diffusion  (MFAD)  model.  The  propaga¬ 
tion  of  diffusion  starts  with  initial  seed  nodes  So  Q  V  which  have  adopted  item  i. 
In  the  first  timestamp  t\,  each  u  £  So  tries  to  influence  u’s  out-neighbors  which 
are  not  activated  by  item  i.  The  successful  activation  probability  for  v  £  Nout(u) 
is  calculated  as  pu,vfv{x).  The  activated  nodes  in  1 1  are  denoted  as  Si.  In  times¬ 
tamp  t2,  new  activated  nodes  in  S i  try  to  activate  their  inactivated  out-neighbors 
in  the  same  process  as  the  above.  The  process  runs  iteratively.  If  Sj  is  empty  in 
timestamp  tj,  the  diffusion  terminates.  Note  that  the  spread  size  can  be  expressed 
as  |  Uo<i<j-i  Si |  and  when  a  node  becomes  activated,  it  never  turns  to  be  inac¬ 
tivated. 

MFAD  is  a  two-stage  propagation  model.  An  illustration  of  the  successful 
activation  process  in  MFAD  is  shown  as  Figure  1.  In  the  first  stage,  called  influ¬ 
ence  transmission,  a  node  u  tries  to  influence  each  v  £  Nout(u)  which  is  not 
activated  with  probability  pU}V  in  timestamp  U-i,  where  u  is  activated  in  times¬ 
tamp  ti_2.  However,  the  influence  successfully  transmitted  over  the  edge  does 
not  directly  activate  a  node.  Our  model  has  the  following  stage,  called  adoption 
decision.  In  the  second  stage,  if  the  influence  of  u  is  successfully  transmitted  with 
probability  pUiV  to  v  previously,  v  receives  this  influence  and  decides  whether  it 
becomes  activated  based  on  its  considerations,  predicted  via  its  classification 
model  fv(x).  If  v  is  activated,  v  will  try  to  influence  its  neighbors  in  the  next 
timestamp  ti.  Consider  a  news  diffusing  in  an  online  social  network,  e.g.  Face- 
book  and  Twitter.  A  user  u  posted  a  message  about  the  news  recently.  Due  to  the 
ranking  mechanism  designed  by  Facebook  or  since  there  are  too  many  messages, 
a  friend  v  of  u  may  not  see  the  message,  which  is  modeled  by  influence  transmis¬ 
sion.  Moreover,  even  if  the  friend  v  reads  the  message,  v  may  consider  whether 
to  share  or  reply  to  the  message  based  on  several  concerns,  e.g.  u’s  interests,  the 
importance  of  the  news,  which  is  modeled  by  adoption  decision.  In  contrast  to 
the  traditional  diffusion  models,  MFAD  is  flexible  to  consider  multiple  factors 
and  considers  more  in  a  microscopic  view  for  information  propagation. 

4  Two-Stage  Learning 

In  this  section,  we  propose  a  two-stage  learning  framework  for  MFAD  since 
MFAD  is  a  two-stage  propagation  model.  In  the  first  stage,  the  classifier  of  each 
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node  is  trained,  which  corresponds  to  adoption  decision,  while  the  transmission 
probability  between  two  connected  nodes  is  estimated  in  the  second  stage,  which 
corresponds  to  influence  transmission.  We  first  introduce  the  observed  data. 

Observed  Data.  A  propagation  trace  consists  of  activation  records.  Each  record 
{u,  v,  i,  t}  represents  that  node  u  adopts  an  item  i  at  timestamp  t  and  the  adop¬ 
tion  is  caused  by  u’s  in-neighbor  v.  If  u  is  actually  a  seed  of  the  item  in  the 
observed  data,  v  does  not  exist  and  is  set  to  be  NIL.  In  reality,  we  usually 
do  not  observe  that  a  node  fails  to  influence  others.  In  other  words,  we  would 
only  have  positive  instances  for  training  classifiers  directly  from  the  propagation 
trace.  We  next  discuss  how  to  learn  classifiers  of  nodes  in  such  a  situation. 

4.1  Learning  Classifiers  of  Nodes 

Due  to  the  limitation  of  observation  in  the  real  world,  only  positive  instances  are 
available  to  train  binary  classifiers  of  each  nodes.  However,  a  classifier  trained 
on  only  positive  instances  is  hard  to  achieve  good  performance.  In  fact,  unlabeled 
instances  can  provide  more  information  for  learning,  e.g.  feature  distribution,  and 
can  be  generated  via  observing  that  an  inactivated  out-neighbor  of  an  activated 
node  does  not  turn  activated  in  the  next  timestamp.  We  use  the  term  unlabeled 
instead  of  negative  since  an  unlabeled  instance  may  be  positive  or  negative  due  to 
the  limitation  of  observation.  Thus,  the  task  of  the  first-stage  learning  becomes 
training  classifiers  by  using  positive  and  unlabeled  instances.  In  the  literature 
[6,12],  the  problem  is  called  positive  and  unlabeled  learning.  Among  previous 
work  on  positive  and  unlabeled  learning,  C.  Elkan  and  K.  Noto  [6]  provide  a 
principled  way  to  assigning  weights  to  unlabeled  instances.  Based  on  their  work 
[6],  we  construct  a  framework  to  learn  nodes’  classifiers  for  MFAD.  In  the  follow¬ 
ing,  we  first  describe  how  to  obtain  unlabeled  instances  from  observed  positive 
records  and  then  describe  how  to  train  a  node’s  classifier  based  on  positive  and 
unlabeled  instances. 

Obtaining  Unlabeled  Instances.  With  observed  positive  records,  we  can 
analyze  the  whole  propagation  trace  to  get  positive  instances  for  training  nodes’ 
classifiers  with  ease.  However,  negative  instances  are  hard  to  obtain  due  to  two 
main  reasons:  (1)  an  item  does  not  successfully  be  exposed  to  a  node  from  its  in¬ 
neighbor;  (2)  the  observation  window  for  a  node  is  not  long  enough.  Fortunately, 
we  can  generate  unlabeled  instances  to  help  train  a  node’s  classifier. 

Assume  that  we  have  the  complete  propagation  trace  which  consists  of  posi¬ 
tive  records  in  the  format  of  (u,  s ,  i,t,o  =  1)  where  the  binary  variable  o  indicates 
whether  a  record  is  an  observed  positive  record  in  the  trace  or  not.  If  a  node 
u  adopts  an  item  i  from  s  at  timestamp  i,  we  can  observe  u’s  out-neighbors 
who  haven’t  adopted  i  from  timestamp  t.  For  an  out-neighbor  v  of  u,  if  v  does 
not  adopt  i  in  the  complete  propagation  trace,  we  generate  an  unlabeled  record 
(v,  u,  i.  t' ,  o  =  0),  where  t'  is  the  end  observation  time  of  propagation  trace.  How¬ 
ever,  the  approach  is  with  high  cost  and  does  not  work  if  the  size  of  the  trace 
is  extremely  large.  For  a  program  to  sequentially  trace  the  positive  records  in 
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Algorithm  1.  Unlabeled  Record  Generation 

Require:  graph  G  =  (V,  E),  complete  propagation  trace  D  sorted  in  chronological 
order,  time  window  T 
Ensure:  unlabeled  records 

1:  Let  M  be  a  table  to  memorize  possible  unlabeled  instances 
2:  for  all  (u,s,i,t,  1)  in  |B|  do 
3:  while  M  contains  a  record  (u,  *,  i,  *,  0)  do 

4:  remove  the  record  from  M 

5:  for  all  v  in  u’s  out-neighbors  do 

6:  if  v  hasn’t  adopted  i  then 

7:  insert  an  unlabeled  record  (v,u,i,t,  0)  into  M 

8:  while  the  timestamp  of  the  oldest  record  in  M  is  less  than  t  —  T  do 

9:  output  and  remove  the  record  from  M 

10:  while  \M\  >  0  do 

11:  output  and  remove  the  record  from  M 


chronological  order,  it  has  to  memorize  all  items  that  a  node  has  not  adopted 
in  order  to  generate  unlabeled  records  at  the  end  of  tracing,  which  is  impossi¬ 
ble  for  a  single  machine  with  limited  memory  size.  Otherwise,  multiple  scans  are 
needed,  which  incurs  many  disk  I/O  operations  and  therefore  is  time-consuming. 

In  a  more  general  way,  we  propose  Algorithm  1  to  trace  positive  records  to 
generate  unlabeled  records.  Let  T  be  the  time  window  to  observe  whether  an 
out-neighbor  v  of  u,  for  a  positive  record  1),  adopts  i  before  t  +  T. 

The  main  idea  of  the  algorithm  is  that  if  v  does  not  adopt  i  before  t  +  T ,  the 
algorithm  generates  an  unlabeled  record  (v,u,i,t,  0).  Although  the  pseudo  code 
of  Algorithm  1  is  written  in  a  batch  way,  it  is  easy  to  adapt  it  to  process  positive 
records  coming  in  a  streaming  way.  Note  that  a  feature  vector  x,  i.e.  an  instance, 
is  generated  at  the  same  time  when  a  positive  record  is  traced  or  an  unlabeled 
record  is  generated  in  order  to  capture  the  state  of  an  exposure. 


Training  a  Node’s  Classifier.  With  the  above  approach,  for  a  node  u,  we 
can  obtain  positive  instances  P(u)  and  unlabeled  instances  U(u)  for  training 
u' s  classifier  in  order  to  predict  the  adoption  tendency  of  an  instance.  Given 
an  instance  x,  the  goal  is  to  predict  p(a  =  l|x),  where  a  is  a  binary  random 
variable  to  indicate  whether  the  instance  is  positive  (a  =  1)  or  negative  (a  =  0). 
Recall  that  o  is  a  binary  variable  to  indicate  an  instance  is  observed  (o  =  1)  or 
unlabeled  (o  =  0).  In  the  lemma  derived  in  [6],  p(a  =  l|x)  =  p{o  =  l|a;)/c,  where 
c  =  p(o  =  l|a  =  1)  is  a  constant  value1.  Based  on  the  lemma,  they  [6]  further 
reach  the  result  on  how  to  give  weights  to  instances  rigorously  as  the  following. 
The  weight  of  a  positive  instance  is  still  unit,  while  an  unlabeled  instance  have 
two  copies,  where  one  copy  is  a  positive  instance  with  weight  p(a  =  l|x,o  =  0) 
and  the  other  copy  is  a  negative  instance  with  weight  1  —  p(a  =  l|x,  o  =  0). 

1  Due  to  the  space  limit,  we  omit  the  details  of  the  lemma.  If  readers  are  interested 
in  the  lemma  and  the  corresponding  results,  please  refer  to  [6]. 
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Note  that  p(a  =  l\x,  o  =  0)  =  Ap  i-^0=i\x)  ■  Thus,  our  task  here  becomes  three 
subtasks:  (1)  to  learn  p(o  =  l|x),  (2)  estimate  c  and  (3)  learn  p(a  =  l|x). 

Specifically,  for  a  node  u,  (1)  we  first  train  a  nontraditional  classifier  gu{x)  = 
Pu(o  =  1|*)  on  P(u)  —  V(u)  and  U(u),  where  the  instances  in  Y(u)  are  randomly 
selected  from  P(u),  which  is  reserved  as  a  validation  set  to  estimate  c.  (2)  Next, 
c  is  estimated  as  jY^yy  Sxev(u)  9u(x)  according  to  [6].  (3)  Finally,  we  construct 
positive  instances  P'(u)  and  negative  instances  N(w)  to  train  a  traditional  classi¬ 
fier  fu(x)  =  pu{a  =  l|x)  for  the  node  u.  P'(u)  contains  the  instances  in  P(u)  and 
copies  from  the  instances  in  U(u)  with  each  weight  pu(a  =  l|x,  o  =  0).  N(u)  con¬ 
sists  of  copies  from  the  instances  in  U(u)  with  each  weight  1—  pu(a  =  l|x,  o  =  0). 
In  the  experiments,  we  use  the  logistic  regression  to  train  both  gu(x)  and  fu(x) 
since  its  output  probability  is  well-calibrated  [6]  by  applying  the  above  way. 

4.2  Learning  the  Transmission  Probability 

With  the  above  P(u),  U(u)  and  the  trained  classifier  fv(x)  for  each  node  v  £  V, 
we  now  describe  how  to  learn  transmission  probability  between  two  connected 
nodes.  Let  B  =  UueyW'1’)  UU(u))  denote  the  dataset  for  learning  transmission 
probabilities.  An  instance  x  €  B  is  in  the  format  of  (fi,f2,...,f m)[u,v,o,t,i]  where 
u  is  the  node  that  tries  to  activate  v  by  item  i  before  time  t  ( xa  =  1)  or  at  time 
t  ( x0  =  0),  o  is  a  binary  variable  to  indicate  whether  v  is  activated  during  the 
observation  in  Algorithm  1  and  f i ,  f2, ...,  fm  are  factors  of  adoption,  calculated 
in  the  same  time  of  running  Algorithm  1  for  the  exposure.  An  instance  x  £  D  is 
unlabeled  if  x0  =  0;  otherwise,  x  is  positive. 

We  train  the  MFAD  model  via  maximizing  the  likelihood  of  D  in  the  MFAD 
model.  Let  Ds  denote  the  data  of  node  s  in  O,  i.e.  Vs  =  {x  £  B|xw  =  s}, 
and  let  0  denote  all  parameters  of  the  MFAD  model  to  learn,  i.e.  all  transmis¬ 
sion  probabilities,  and  0S  —  {pqiS \q  £  Nm(s)}  denote  transmission  probabilities 
between  node  s  and  its  in-neighbors  Nm(s).  Assuming  adoptions  between  nodes 
are  independent,  the  complete  data  log-likelihood  can  be  expressed  as  follows. 

£(0;D)  =  log  C(0s;T>s)  (1) 

sev 

Note  that  since  we  want  to  learn  transmission  probability  between  two  connected 
nodes,  we  exclude  an  instance  x,  xu  of  which  is  NIL,  from  B.  Moreover,  since 
f8(x)  for  each  node  s  has  trained  in  the  above,  the  data  likelihood  of  each  T>s  is 
only  related  to  0S.  To  maximize  Eq.(l)  is  equal  to  maximizing  each  C{0S ;  Ps),i.e. 

Vs  £  V,  maxlog  jC(0s;  Vs).  (2) 

In  reality,  the  diffusion  happens  in  a  continuous  time  space,  while  MFAD  is 
a  discrete  time-based  diffusion  model.  We  include  time  constraints  A+  and  A~ 
to  decide  the  validity  of  an  instance.  Let  T>+  =  {x  £  D|x0  =  1}  and  V~  =  {x  £ 
D|x0  =  0}.  We  define  and  V~  as  follows. 

Vf  ={x  £  Vs\x0  =  1  A  3y  £  V+{yv  =  xu  A  yi  =  Xi  A  0  <  xt  -  yt  <  A+)}  (3) 
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Vs  ={x  G  Vs\x0  =  0  A  By  €  V+(yv  =  xu  A  yt  =  Xi  A  0  <  xt  -  yt  <  A  )}  (4) 

Note  that  for  an  instance  x  G  T>+,  xu  G  Nm(xv)  should  hold,  where  Nm(v ) 
is  u’s  in-neighbor  set.  The  data  likelihood  of  node  s  is  then  defined  as 

C(Os:  T>s)  =  (pq,sfs{%))  (1  ~~  Pq,sfs{%))i  (g) 

qeN™(s)  xGV+s  geJV««(a)ajex)^s 


where 

fs(x)  =ps(a  =  l\x),V+s  =  {a;  G  V+\xu  =  q} 
and  Vq,s  =  ix  e  T>j\xu  =  q}. 

The  data  log- likelihood  of  node  s  is: 

log  C(0s-,Vs)=  ^2  lY  l°g(Pq,sfs{x))  +  -Pq,sfs{x))}  ^ 

q£N™(s)  x£T>+a  x£D~s 


To  find  pqs  by  maximizing  the  above  log  likelihood,  let  dlosG(Os,vB)  __  q. 


E  p-+  E 

Pq,s 

X£T>qjS  XET>q>s 


-fs(x) 

1  -Pq,sfs{x) 


=  0 


(8) 


Since  no  closed  form  solution  for  Eq.(8)  exists,  we  employ  the  Brent’s  algo¬ 
rithm  [2].  The  Brent’s  algorithm  uses  a  combination  of  golden  section  search 
and  successive  parabolic  interpolation.  For  an  initial  good  guess  pq  s  in  order  to 
converge  fast,  we  apply  the  first  order  Taylor  series  to  approximate 
at  Pq,s  =  0  as  — fs(x )  —  fs(x)2pq,s ■  Thus,  the  Eq.  (8)  becomes 

Y  ~+  hfs(x)  ~  fs{x)2pq,s]=0  ,'g') 

4-  Pq,s  _  '  ' 

xeVgiS  xeVgtS 


and  by  some  mathematical  manipulation  we  get  pq^s  =  cWG^+abd  ^  w]iere 
B  =  I D£sl  C  =  Y,xev-S  fs(x )  and  D  =  lLxev-B  fs(x)2.  Note  that  D  should 
be  a  real  number  greater  than  0  and  obviously,  B  and  C  are  non-negative  real 
numbers.  In  some  situation,  pq,s  will  not  be  a  valid  probability  value,  the  value 

suggested  in  the  Brent’s  algorithm  [2]  is  used  instead. 

5  Experiments 

In  the  section,  we  conduct  experiments  on  a  Twitter  dataset  to  evaluate  the 
effectiveness  on  activation  prediction.  We  first  describe  the  setup. 
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Table  1.  Data  Statistics 


T 

training  instances 

testing  instances 

positive  unlabeled  all 

positive  unlabeled  all 

positive  negative 

positive  negative 

3 

6 

12 

24 

48 

96 


17,197  274,575 

17,197  270,467 

17,197  263,750 

17,197  252,274 

17,197  233,820 

17,197  204,919 


236,807  528,579 

236,807  524,471 

236,807  517,754 

236,807  506,278 

236,807  487,824 

236,807  458,923 


115,955  80,243 
114,463  78,788 
112,063  76,496 
108,180  72,864 
101,010  66,049 

88,807  54,793 


15,288  211,486 
15,940  209,191 
17,001  205,560 
18,893  199,937 
20,705  187,764 
24,520  168,120 


5.1  Setup 

Dataset.  We  use  the  real  dataset  collected  from  Twitter  by  L.  Weng  et  al. 
[17].  We  use  standard  preprocessing  steps,  similar  to  the  steps  used  in  [1]  to 
clean  diffusion  data.  However,  in  order  to  obtain  enough  size  of  training  data 
for  training  nodes’  classifiers,  we  allow  multiple  activation  records  of  the  same 
item  for  a  node.  After  the  preprocessing,  each  node  has  at  least  20  activation 
records,  i.e.  retweets  and  tweets  with  hashtags  in  Twitter,  and  each  item,  i.e. 
hashtags,  are  adopted  by  at  least  20  nodes.  Moreover,  there  is  no  isolated  node 
left.  The  remaining  social  graph  consists  of  24,045  nodes  and  871,745  directed 
edges.  The  remaining  activation  records  contain  8,427  different  items  and  the 
number  of  all  activation  records  is  1, 105, 316.  The  dataset  spans  from  March  23 
to  April  25  in  2012,  approximately  one  month  long. 

Factors.  We  first  define  some  notations.  Let  degin{v)  and  degout{v)  denote  ids 
in-degree  and  out-degree  in  the  graph.  For  an  item  i,  we  use  tgi0(i)  to  denote 
the  earliest  time  in  which  i  is  adopted  by  some  node  in  the  data  and  tioc(i,  v,  t) 
to  denote  the  earliest  time  in  which  i  is  exposed  to  v  by  an  in-neighbor  of  v  that 
adopts  i  before  time  t.  Let  nodegi0{i ,  t)  denote  all  nodes  activated  by  item  i  before 
time  t  and  nodeioc(i,t,v)  to  represent  in-neighbors  of  v  which  are  activated  by 
item  i  before  time  t.  For  a  directed  edge  from  u  to  v ,  we  use  ratio  from(v,u,t) 
to  denote  the  ratio  that  v's  adoptions  are  caused  by  u  and  ratiosame(v,u,t )  to 
denote  the  ratio  that  u’s  adopted  items  are  the  same  as  w’s  adopted  items  before 
time  t. 

For  an  instance  x  =  (fi,f2,  ■■■,^m)[u,v,o,t,i]i  where  node  u  tries  to  activate  v  by 
item  i  before  time  t  ( x0  =  1)  or  at  time  t  ( xQ  =  0),  the  features  fi,f2,  ...,fm  are 
composed  of  three  types,  structure-based,  time-based  and  history-based  features. 
Structure-based  features  include  degin(u),  degout(u),  degin(v),  degout(v)  and  the 
number  of  common  neighbors  between  u  and  v.  Time-based  features  are  t  — 
tgio(i),  t  —  tioc{i ,  v,  t)  and  t(i,  u)  —  t ,  where  t(i,  u)  is  the  time  that  node  u  adopts 
the  item  and  if  it  is  unavailable  in  the  dataset,  we  assume  t(i,u)  —  t  =  0.  The 
first  two  are  able  to  reflect  global  and  local  freshness.  The  last  one  is  to  measure 
the  adoption  latency.  History-based  features  are  \nodegi0(i,t)\,  \nodeioc(i,t,v)\, 
ratio from(v,  it,  t)  and  ratio same(v,  u,  t).  Thus,  there  are  m  =  12  features  in  total 
for  training  a  node’s  classifier  in  the  experiment. 
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Instances.  We  use  the  approach  introduced  in  Section  4.1  to  generate  positive 
and  unlabeled  instances  from  the  dataset  with  different  T  for  the  most  active 
100  nodes,  measured  by  the  number  of  positive  records  in  the  whole  dataset.  The 
statistics  of  training  and  testing  instances  are  summarized  as  Table  1.  In  both 
training  and  testing  sets,  we  exclude  instance  x,  xu  of  which  is  NIL ,  since  we 
want  to  learn  transmission  probability  between  two  connected  nodes  for  diffusion 
models.  Note  that  the  positive  instances  for  testing  consist  of  positive  instances 
and  unlabeled  positive  instances,  while  the  negative  instances  for  testing  consist 
of  unlabeled  negative  instances.  For  each  T,  we  use  the  earliest  20%  instances 
as  the  training  set.  From  the  latest  80%  instances,  the  testing  set  only  contains 
instances  related  to  pUiV  that  is  trained  in  the  training  data  for  MFAD.  Thus,  the 
satisfied  testing  instances  are  not  too  many.  Moreover,  the  number  of  unlabeled 
instances  for  training  is  much  more  than  the  labeled  positive  instances  since  the 
earliest  20%  time  (~  6.6  days)  is  relative  short  and  when  a  node  u  adopts  an 
item  i  at  time  t(i,u)  but  u’s  in-neighbors  all  adopt  i  before  t(i,u)  —  T,  \Nm(u)\ 
unlabeled  positive  instances  will  be  generated. 

Methods.  We  include  the  following  three  methods  to  predict  activations  of 
nodes:  (1)  the  logistic  regression  directly  trained  on  positive  and  unlabeled 
instances  (LOGIST),  which  is  the  classical  approach,  (2)  our  proposed  learn¬ 
ing  framework  for  the  MFAD  model  (MFAD)  and  (3)  the  independent  cascading 
model  (IC).  Note  that  only  MFAD  and  IC  are  diffusion  models,  while  LOGIST 
is  a  classification  algorithm  only  and  cannot  be  applied  to  other  applications  on 
diffusion,  e.g.  influence  maximization  [10].  We  select  the  IC  model  instead  of  the 
LT  model  since  IC  is  also  a  probabilistic  diffusion  model.  The  parameters  of  IC 
are  inferred  by  the  maximum-likelihood  estimation  conducted  in  the  same  app¬ 
roach  of  Section  4.2.  For  two  connected  nodes  u  and  v,  the  influence  probability 

I  £>+  I 

pu>v  is  +  |  for  IC.  While  training  the  nodes’  classifiers  for  both  LOGIST 

and  MFAD,  the  class  imbalance  problem  is  encountered,  i.e.  skewed  class  dis¬ 
tribution.  We  use  SMOTE  [4],  which  doubles  the  size  of  the  minority  class,  and 
then  apply  SpreadSubsample  [8]  to  undersample  instances  of  the  majority  class 
to  balance  the  class  distribution.  Moreover,  we  set  time  constraints  A+  and  A~ 
in  Eq.  (3)  and  (4)  as  the  same  value  of  T.  All  methods  are  implemented  in  Java 
with  Weka  [8]  and  executed  in  a  PC  with  an  Intel  i7  3.4GHz  CPU.  The  running 
time  of  a  run  of  MFAD  for  the  same  T  does  not  exceed  2  hours,  including  time 
for  sampling,  training  100  nodes’  classifiers  and  learning  related  transmission 
probabilities. 

Metrics.  We  use  four  metrics,  precision,  recall,  F-Measure  and  accuracy,  to 

measure  the  results  of  activation  prediction,  based  on  true  positive  (TP),  false 

positive  ( FP ),  true  negative  ( TN )  and  false  negative  (FN)  instances.  Precision 

is  Recall  is  T^N.  F-Measure  is  and  accuracy  is 

defined  as  _ tp+tn _ 

uAAiiiicu.  ao  TP-\-FP-\- TN+F N  ' 
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Fig.  2.  Overall  Results 
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Fig.  3.  Results  of  Components  of  F-Measure 


5.2  Results 

The  overall  results  of  activation  prediction  are  shown  in  Figure  2.  Our  MFAD 
outperforms  the  other  two  methods  significantly  in  the  overall  metrics,  F-Measure 
and  accuracy.  F-Measure  in  Fig.  2(a)  concerns  mainly  on  true  positive,  false  pos¬ 
itive  and  false  negative  instances,  while  accuracy  in  Fig.  2(b)  takes  true  negative 
instances  into  consideration.  F-Measure  is  suitable  for  the  scenario  of  retrieval 
of  activated  nodes  whereas  accuracy  is  more  suitable  for  the  scenario  of  spread 
estimation.  MFAD  works  great  for  both  scenarios.  For  the  components  of  F- 
Measure,  MFAD  is  very  effective  in  precision  as  shown  in  Fig.  3(a),  which  means 
the  size  of  false  positive  instances  is  much  smaller  than  those  of  LOGIST  and 
IC.  The  recalls  of  MFAD  and  LOGIST  are  close  to  each  other  but  much  better 
than  that  of  IC  as  shown  in  Fig.  3(b).  Moreover,  as  T  increases,  F-Measure  and 
accuracy  become  better  for  all  methods  since  the  positive  unlabeled  instances 
are  fewer  and  thus  more  positive  instances  are  available  for  training  classifiers. 

In  summary,  MFAD  is  the  best  method  to  predict  activation  of  nodes  among 
three  methods.  Most  importantly,  MFAD  is  a  diffusion  model  and  therefore 
can  simulate  how  information  diffuses  whereas  LOGIST  cannot.  IC  is  also  a 
diffusion  model,  but  it  cannot  reflect  the  state  of  an  exposure  precisely  and  thus 
do  not  model  the  diffusion  well.  Although  there  is  an  extension  of  IC  in  [1], 
called  TIC,  to  consider  the  topic  factor,  the  dataset  does  not  have  the  detailed 
textual  information  of  tweets  and  hence  we  do  not  include  TIC  in  the  experiment. 
Nevertheless,  our  MFAD  model  can  consider  the  topic  factor  by  defining  new 
features  for  nodes’  classifiers  with  ease,  which  does  not  require  the  modifications 
of  the  learning  framework. 
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6  Conclusions 

In  this  work,  we  propose  the  model  of  Multiple-Factors  Aware  Diffusion  Model 
(MFAD)  which  explicitly  models  influence  transmission  and  adoption  decision 
and  considers  multiple  factors  flexibly  that  may  affect  adoption  behaviors.  The 
learning  framework  of  MFAD  is  independent  of  factors  considered  and  is  effective 
as  shown  in  the  experiment.  Therefore,  MFAD  has  more  flexibility  and  can  be 
applied  to  different  scenarios  for  different  purposes  with  ease.  In  the  future,  we 
will  design  influence  maximization  algorithms  for  MFAD. 
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